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Abstract 

Pedigree graphs, or family trees, are typically constructed by an expensive process of exam- 
ining genealogical records to determine which pairs of individuals are parent and child. New 
methods to automate this process take as input genetic data from a set of extant individuals 
and reconstruct ancestral individuals. There is a great need to evaluate the quality of these 
methods by comparing the estimated pedigree to the true pedigree. 

In this paper, we consider two main pedigree comparison problems. The first is the pedigree 
isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. 
The second is the pedigree edit distance problem, for which we present 1) several algorithms that 
are fast and exact in various special cases, and 2) a general, randomized heuristic algorithm. 

In the negative direction, we first prove that the pedigree isomorphism problem is as hard 
as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is 
NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and 
NP-hard on leaf-labeled pedigrees. 

We use simulated pedigrees to compare our edit-distance algorithms to each other as well as 
to a branch-and-bound algorithm that always finds an optimal solution. 
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1 Introduction 



Pedigrees, or family trees, are of interest in a variety of fields. They are interesting to geneticists 
due to the accuracy with which recombinations can be inferred [11] and with which disease loci 
can be mapped [2SII21]- Likelihood calculations, i.e. calculations of the probability of observing 
the data inherited in a given pedigree, are of great interest for mapping disease loci. Pedigrees 
are objects of interest in computer science due to their close connection with machine learning 
methods [l9j[T2]. Many calculations on pedigree graphs are hard [261 E01 US] , but notable attempts 
have been made to improve the speed of these calculations [3 H31 ED] • 

Reconstructing pedigrees is thus an interesting but difficult problem. Genealogical methods 
for reconstructing pedigrees can involve multiple sources with contradictory or missing informa- 
tion [151 EH EH E2J [5]. Due to the error- prone nature of genealogical pedigree reconstruction 
and the unavailability of genealogical data for some animals, the pedigree reconstruction problem 
was introduced by Thompson |35] as follows: given genetic data for a set of extant individuals, 
reconstruct relationships between those individuals that may involve unobserved ancestors. State- 
of-the-art practical methods include [351 GHJ HI El H7] and theoretical work includes [Ml [33] . 

Evaluating reconstruction methods requires inferring a pedigree on an instance for which the 
true pedigree is known and comparing the inferred and known pedigrees. Both the estimated pedi- 
gree and the true pedigree will have the same set of extant individuals — i.e. the individuals having 
genetic data — but may have different inferred ancestors. Thus, to compare these two pedigrees, we 
must compare their topology in a fashion that respects the labels of the extant individuals. 

Existing methods of comparing pedigrees are insufficient. For example, phylogenetic tree com- 
parison methods can only be used to compare tree- like pedigrees, but pedigrees can take more 
general forms. As another example, [T7] evaluates accuracy using the kinship coefficient of all 
pairs of individuals, where the kinship coefficient of two individuals at a single locus is the average 
number of alleles inherited from the same ancestor. This is a poor accuracy measure, since the kin- 
ship coefficient is not identifiable. For example, two half-siblings have the same kinship coefficient 
as an uncle and nephew. A recent result demonstrates non-identifiability of larger pedigrees [271. 
Furthermore, the pedigree likelihood is not an acceptable pedigree comparison method both be- 
cause it requires an exponential-time algorithm and because the non-identifiability of the kinship 
coefficient may well imply the non-identifiability of the likelihood. While pedigree isomorphisms 
were discussed very briefly by Steel and Hein |30j in the context of pedigree reconstruction, they 
did not discuss the pedigree isomorphism problem and its algorithms. Finally, brute force methods 
of pedigree comparison are not sufficient, as can be seen in our own brute-force comparison where 
the simulation was limited to pedigrees of fourteen individuals. This is in contrast to pedigree 
data sets which include thousands of individuals [HE2]. Other biologists are collecting data from 
hundreds of individuals from large families — i.e. 60 individuals per family for salmon |14|[3] — where 
brute-force and likelihood methods fall short due to exponential running times. 

Two natural formulations of the pedigree comparison problem are discussed in this paper: 
pedigree isomorphism and pedigree edit distance. Two pedigrees are isomorphic if there exists a 
graph isomorphism which respects the genders of all individuals and the identities of the individuals 
for which we have genetic data; the pedigree isomorphism problem is to determine whether two 
pedigrees are isomorphic. The more difficult pedigree edit distance problem is to determine how 
many edge insertions and deletions are needed for two pedigrees to become isomorphic. 

In this paper, we formalize the isomorphism and edit distance problems, provide useful algo- 
rithms for certain instances of these problems, and give four hardness results. The algorithms 
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Figure 1: An Example Pedigree. This pedigree has edges implicitly directed downward from 
parent to child and shows two founding grand-parents, their four children, two inbred grand- 
children and two inbred great-grand-children. Each edge represents the transmission from parent i 
to offspring j of a single (possibly recombinant) copy of each chromosome. 

we present include a fast algorithm for leaf-labeled pedigree isomorphism and a polynomial-time 
dynamic programming algorithm for the edit distance of sufficiently similar pedigrees. For general 
pedigrees, we present a heuristic algorithm for the edit distance. 

Our first hardness result is that pedigree isomorphism is as hard as general graph isomorphism, 
making it Gl-hard. The reduction we use also shows that sub-pedigree isomorphism is as hard as 
sub-graph isomorphism, making it NP-hard. The third and fourth hardness results show APX- 
hardness for the edit distance problem in general and NP-hardness on pedigrees whose leaves are 
all labeled. 

2 Preliminaries 

A pedigree V = (P, s, X, £) consists of a pedigree graph P = (I(P), E(P)) with vertices I(P) and 
edges E(P), a gender function s : I(P) — > {m, /}, a set X C J(P) of labeled individuals, and 
an injective labeling £ : X — > N, such that: 

1. P is directed and acyclic. 

2. For all v £ V, the in-degree of v is either two or zero. 

3. If (a, v), (b, v) e E, then s(a) ^ s(b). 

P is called the pedigree graph of V . Vertices of P are called the individuals of V . Individuals 
with in-degree zero are founders while individuals with in-degree two are non-founders. Individ- 
uals with out-degree zero are called leaf individuals. For an individual x £ X, £(x) is the label 
assigned to x. We will sometimes write V = (P, s) to mean V = (P, ®,s,£) with trivial £. When 
X = I(P), we will say that V is fully labeled. If the reference pedigree is clear, we may write I 
to refer to I(P). Figure [T] depicts an example of a fully labeled pedigree. 

The labeled individuals X Q I typically have an available DNA sample on which genotyping 
or sequencing can be performed. While it may be algorithmically convenient to assume that we 
have samples for all individuals, X = I, this assumption is quite impractical. Very often, there are 
individuals represented in the pedigree who are deceased and for whom samples are unavailable. 
There are circumstances in which there will be no labeled individuals, i.e., X = 0. Then we would 
want to rely on the genealogical structure alone in determining whether the same individuals appear 
in the two pedigree graphs. 
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A pedigree V = (P, s, X, £) is leaf-labeled if X contains exactly the leaf-individuals of V. Very 
often this is the case since, for likelihood calculations, one is only interested in the individuals for 
whom identifying data exist. 

Two pedigrees V = (P,s,X,£) and V' = (P', s', X' , £') are compatibly leaf-labeled if both 
are leaf-labeled and for every leaf individual v £ I(P), there is a leaf individual v' £ I(P') such 
that £{v) = £'{v') and vice versa. Given two leaf-labeled pedigree graphs P and Q, we can always 
obtain new pedigree graphs P' and Q' that are compatibly leaf-labeled. This is done by performing 
a depth-first search starting at each compatibly leaf- labeled individual v, following parent-child 
edges backward in time, and then pruning those individuals which were not visited by the depth- 
first search. This is interesting, because this allows us to detect an isomorphic subgraph which is 
compatibly leaf-labeled. 

A pedigree gives rise to sub-pedigrees in the following way. If two pedigrees V = (P, s, X, £) 
and V = (P',s',X',£') satisfy: 

1. I(P') C I(P) and P' is the subgraph of P induced by I(P'), 

2. x' = xni(P'), 

3. £' = £\itp'\, and 

4. s' = s|/ (P /), 

then we say that V is a sub-pedigree of V, and write V C V. 

Given a set A C I(P), we will write V\a to denote the minimal sub-pedigree of V containing 
the vertices in A. Likewise, if A C P(P), we write V\a to denote the minimal sub-pedigree of V 
containing the edges in A. 

There are two additional types of pedigrees in which we are interested: monogamous and 
generational pedigrees. A pedigree V = (P,s,X,£) is monogamous if all the individuals are 
monogamous. An individual v £ I(P) is monogamous if the number of individuals v' ^ v such 
that (v,u), (v',u) £ E(P) for some u £ I(P) is at most 1. 

A pedigree V = (P, s, X, £) is generational if there exists a function G : I(P) — > N such that 
the following conditions hold: 

1. G(v) = 1 for some v £ I(P) where v has in-degree zero, and 

2. If (u,v) £ E(P), then G{v) = G{u) + 1. 

The number G(v) is called the generation of v, and G is called the generation map of V. 
Whenever V is a generational pedigree with pedigree graph P, we will use I g (P) to denote the 
individuals of V whose generation is g. We will say that a pedigree is connected if its pedigree 
graph is weakly connected. It is easy to see that a connected, generational pedigree has a unique 
generation map. The maximal value of this map is the number of generations of the pedigree. 

2.1 Pedigree Isomorphism 

We now define the notion of an isomorphism between pedigrees. To do so, we first present the more 
general idea of a matching between pedigrees. 

Definition 2.1. Given two pedigrees V = (P,s,X,£) andV = (P', s' , X' , £'), and a setY C I{P), 
an injective map M : Y — > I(P') is a pedigree matching between V and V' if it satisfies the 
following conditions. 

1. For every v £ Y, s(v) = s'(M(v)). 

2. For all n £ £(X) n £'{X'), Y contains £~ l {n) and M(£~ l (n)) = {£' ) _1 (n). 
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Informally, the second condition states that M should respect the labellings £ and £' in the sense 
that if £ and £' give the same label to two vertices u and v respectively, then M should match u up 
to v. 

We can now characterize a pedigree isomorphism as a matching that is also a graph isomorphism 
between the two pedigree graphs. 

Definition 2.2. Given two pedigrees V = (P,s,X,£) and V' = (P', s', X', £'), a bijection (ft : 
I(P) I(P') is a pedigree isomorphism between V and V' if: 

1. (ft is a pedigree matching between V and V' , and 

2. (u, v) G E(P) if and only if (<f>(u) , </>(v)) € E(P') 

The Pedigree Isomorphism Problem. Given two pedigrees V and V', does there exist a 
pedigree isomorphism between them? 

The Compatibly Leaf-Labeled Pedigree Isomorphism Problem. Given two compatibly 
leaf-labeled pedigrees V and V, does there exist a pedigree isomorphism between them? 
The Sub-Pedigree Isomorphism Problem. Given two pedigrees V and V 1 , does there exist 
a pedigree isomorphism between V and some sub-pedigree of V'l 

In this paper, we will show that the compatibly leaf-labeled pedigree isomorphism problem can 
be solved in linear time. We will also show that the pedigree isomorphism problem is as hard as the 
general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. 

2.2 Edit Distance 

We are interested not only in determining whether two pedigrees are isomorphic, but also in how 
close they are to being isomorphic. Such a measure of distance between pedigrees would be useful 
for evaluating pedigree reconstruction methods: we could take a known pedigree, extract a subset 
of its individuals, reconstruct a pedigree from those individuals using our method of choice, and 
check the distance between the reconstructed pedigree and the true pedigree. 

Informally, given two arbitrary pedigrees V = (P,s,X,£) and V' = (P f ,s f ,X',£'), we want to 
find the minimum number of edge additions/deletions required to convert V into V' . We call this 
the edit distance between V and V 1 . Notice that it is not necessary that |/(-P)| = \I(P')\, because 
addition/deletion of edge- less vertices will be free. 

Formally, we can define edit distance in terms of matchings between pedigrees. To do this, we 
need to measure how close a matching is to being a pedigree isomorphism. This is done by looking 
at the set of edges that are well matched by the matching. 

Definition 2.3. Given two pedigrees V = (P,s,X,£) and V 1 = (P' , s', X', £') and a matching 
M :Y — >■ I(P') between them, the set Wm of edges well-matched by M is defined by 

W M = {(u,v)eE(P):u,v,eY,{M{u),M(v))eE{P')}. 

Notice that the subgraph of P induced by the edges in Wm is isomorphic to the subgraph of 
P' induced by the edges in M(Wm)- Therefore, M implicitly defines an edit path from P to P': 
delete all the edges in E(P) — Wm, then add all the edges in E(P') — M(Wm)- Moreover, the 
shortest edit path from P to P' must consist only of removing edges from P and adding edges from 
P', and so can be obtained in this way from a matching (up to the order in which edges are added 
and removed, which does not affect the length of the edit path). With this in mind, we define the 
match distance incurred by M to be length of the corresponding edit path: 
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Definition 2.4. Given two pedigrees V = (P,s,X,£) and V = (P', s', X' , £') and a matching 
M :Y — >■ I(P') between them, the match distance of M is 

d{M) = dp{M)+d P ,{M) 

where d P {M) = \E{P) - W M \ and d P ,{M) = \E{P') - M(W M )\ 

Now the edit distance is the length of the shortest edit path; in other words, the smallest 
possible match distance between the two pedigrees. 

Definition 2.5. Given two pedigrees V = (P,s,X,£) and V' = (P' , s' , X' ,£'), the edit distance 

between V and V' is 

Dvv = mind(M). 

M 

The edit distance between two pedigrees can be calculated by finding a matching M with a 
maximum number of well-matched edges. The following proposition, which we state without proof, 
formalizes this. 

Proposition 2.1. Given two pedigrees V and V ' , a matching M between V and V 1 for which \ Wm\ 
is maximized satisfies D-p -pi = d(M). 

The Pedigree Edit Distance Problem. For pedigrees V and V' , find a matching M between 
V and V such that d(M) = D v ^>. 

Notice that a pedigree isomorphism, if it exists, has a match distance of 0. In this paper, we 
will give efficient algorithms for a few different restrictions of the pedigree edit distance problem. 
The four main problems and their hardness, as established in this paper, are shown in the table. 





Isomorphism 


Edit Distance 


Compatibly Leaf-Labeled 


Linear alg. 


NP-hard 


Not Labeled 


Gl-hard 


APX-hard 



3 A Linear-Time Algorithm for the Compatibly Leaf-Labeled 
Pedigree Isomorphism Problem 

In this section, we introduce a linear-time algorithm for the compatibly leaf-labeled pedigree iso- 
morphism problem. In particular, we will establish that if all leaves are labeled then there exists 
a total order on the individuals that is easy to calculate. The total orders, calculated on both 
pedigrees, are such that if an isomorphism exists between two compatibly leaf-labeled pedigrees, it 
can be easily found from these total orders. 

Proposition 3.1. There exists a linear-time algorithm for the compatibly leaf-labeled pedigree iso- 
morphism problem. 

Proof. We define gender topological sort as follows. Recall that the traditional topological sort 
algorithm does a depth first search (DFS), and when each node is finished being visited, it gets 
pushed into the ordered list. Gender topological sort consists of a DFS of the ancestor tree of each 
leaf, where the female parent is always visited before the male parent (i.e. the DFS visits in the 
opposite direct of the directed edge). From a single leaf, this rule determines the order in which the 
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ancestral nodes of the leaf are visited. Now, we simply use the leaf labeling to sort the leaves, and 
we begin our DFS from each leaf in sorted order. This algorithm finds a total order on the nodes 
of the pedigree that is fully determined by the topology, gender, and leaf labels of the pedigree. 

Let the binary relation < be the total order found by the above gender topological sort. Let 
n = \I(P)\ = \I(P')\. (If |-f(-P)| 7^ \P\P')\ then there is no isomorphism.) The nodes of pedigree 

V are ordered via gender topological sort so that p\ < P2 < ■■■ < p n where pi £ PyP). Similarly, 
the nodes of pedigree V' are ordered so that p[ < p' 2 < ... < p' n where p\ G PyP')- Because 
isomorphism preserves topology, gender, and labels, it must also preserve this total order. Thus, if 
an isomorphism exists between V and V\ it must be 4> defined as <p{pi) = p\. So to check whether 

V and V' are isomorphic, it suffices to compute the gender topological sort for each pedigree, and 
then check whether <ft is an isomorphism. 

The running time of this isomorphism algorithm is linear, since the gender topological sort is 
linear, and after obtaining (f>, checking that the genders, edges, and labels are preserved also requires 
linear time. □ 

Notice that a small modification of this algorithm can find leaf-labeled subgraph isomorphisms. 

4 Algorithms for Computing the Edit Distance 

We show later in this paper that, even for monogamous pedigrees, there is no polynomial-time 
approximation scheme for the edit distance problem unless P = NP. In this section, we show 
that if we restrict the scope of the problem, efficient algorithms are possible. Specifically, we will 
give exact, efficient algorithms for the following two restricted cases of the pedigree edit distance 
problem. In this section we assume that the pedigrees are connected, but this condition can be 
easily removed. 

1. The case in which the two pedigrees are generational, compatibly leaf- labeled, and both have 
two generations. 

2. The case in which the two pedigrees are generational, compatibly leaf-labeled, and "sufficiently 
similar". By sufficiently similar we mean that there exists an optimal matching between 
the pedigrees that preserves generations and that for any two consecutive generations i and 
i + 1, the distance between the two sub-pedigrees obtained by restricting both pedigrees to 
generations i and i + 1 is small. Our algorithm for this case has the advantage that its 
run-time improves as the pedigrees become more similar. 

We then give a randomized heuristic that appears to work well in the general case and is based 
on an alternate characterization of pedigrees in terms of lists of descendants rather than parent-child 
relationships, as well as a faster heuristic for the second case listed above. 

In the rest of this section, we will denote the two pedigrees under consideration by V = 
(P,s,X,£) and V = (P , s' , X' ,£'). 

4.1 Exact Algorithm for Two-Generation, Compatibly Leaf-Labeled Pedigrees 

When V and V' are connected, generational, compatibly leaf-labeled, and have two generations 
each, the edit distance between them can be calculated in polynomial time. We find this distance 
by constructing two maximum-weight bipartite matching instances, one for each gender, whose 
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solutions give us an optimal matching between V and V ' ■ In doing this, we are maximizing the 
number of well-matched edges, which is equivalent to minimizing the distance. 

Recall that Ig{P) is the set of individuals in the gth generation of P. Our assumption that both 
pedigrees are compatibly leaf-labeled determines the matching M on l2{P): map each v G l2{P) 
to (i') -1 (£(v)) G l2(P'). In addition, we may assume without loss of generality that no individuals 
in the oldest generation of P and P' are labeled. This is because the labels that I\(P) and Ii(P') 
share determine the matching on those vertices, and the labels not shared are irrelevant to the edit 
distance. 

It is left to extend M optimally to h(P)- To do this, we first define l{(P) and I{"(P) to 
be the females and males of I\(P) respectively, and define l((P') and jp(P') analogously. Now, 
for each gender s, we construct a complete bipartite graph G s with left vertices I((P) and right 
vertices Ii(P') where the weight assigned to an edge (u,v) is the number of children of u who are 
matched by M to children of v. Together with Proposition [2Tj the following proposition establishes 
that solving the maximum-weight bipartite matching instances G* and G m will yield an optimal 
matching of V and V' . 

Proposition 4.1. For s G {m, /}, let M s be a perfect matching in G s , and extend the matching 
M to a matching M defined on all of I(P) as follows: for v G I\{P), define M(v) to be the vertex 
matched to v by M s ^ . Then the number of edges well-matched by M is the sum of the weights of 
M f and M m . 

4.2 Exact Algorithm for Sufficiently Similar, Generational, Compatibly Leaf- 
Labeled Pedigrees 

Suppose that there exists an optimal matching M between V and V' that is generation preserving 
(i.e., such that the generation of v equals the generation of M(v) for all individuals v matched by 
M). We will also assume for simplicity that V and V' are each made up of g generations of m 
males and m females each, though this assumption can be easily removed. If the pedigree graphs P 
and P' are similar enough, we can use dynamic programming to find an optimal matching between 
them in time exponential only in the edit distance. Thus, we have a polynomial-time algorithm if, 
for any two consecutive generations i and i + the edit distance between the two sub-pedigrees 
obtained by restricting both pedigrees to generations i and i + 1 is at most 0(log(n)/g), where n 
is the number of pedigree individuals. 

To describe our algorithm, we first introduce some notation. Given some S C {1, 2, ...,(7}, we 
let V\s denote the minimal sub-pedigree of V that contains h(P) f° r every i G S, and we define 
V'\s analogously. We can then write Ai(S) to denote the set of all generation-preserving matchings 
from V\s to V'\s) note that 

M(S) = x ieS M({i}) 

where is the set of matchings of generation i. Given a generation-preserving matching M, 

let ds(M) be the match distance incurred by M restricted to be a matching between V\s and V'\s, 
and let Bi(M) be the edit distance between 'PUi g\ an d P'|{j,... )ff } taken only over matchings that 
agree with M wherever M is defined. 

As in the previous section, we assume without loss of generality that only the leaf individuals 
of either pedigree are labeled. 

The algorithm we present rests on the following relation. 
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Lemma 4.1. For every i £ {1, . . . , g — 1}, and for every M £ M{{i}), we have 

Bi{M) = M , e min +i}) B i+1 (M') + d [i>i+1} {M U M') (1) 

where MU M' 6 M({i,i + 1}) denotes the matching that equals M on V\n\ and M' on V'lu+iy- 

Lemma |4.1| gives rise to a simple dynamic programming algorithm: start with the matching of 
the gth generation determined by the labellings of the leaves, then iteratively work up the pedigree, 
using the lemma above to find, for every i, the values of Bi(M) for every M £ S{{i}). The edit 
distance is then given by 

min BAM). 

M£M({1}) 

However, the problem with this straightforward algorithm is that because it needs to consider 
every possible matching M G A4({i}), its run-time is factorial in m, the number of males/females 
in each generation. At each generation, there are (ml) 2 possible matchings M to process, and 
performing the minimization for each matching takes time 0((m!) 2 ). Therefore, the run-time of 
this algorithm is 0(g(m\) 4 ). 

We can improve the algorithm if we know that the two pedigrees under consideration are suffi- 
ciently similar at each generation and so there is no need to consider all matchings for each genera- 
tion. Suppose we are promised that an optimal matching M between V and V' has d^ ^ + u(M) < k 
for every 1 < i < g. Then in the algorithm above we would only need to process, for each i and each 
M £ M{{i}), the matchings M' € M({i + 1}) such that d{ i)i+ iy(M U M') < k. This enumeration 
can be done, for a fixed M' £ M({i + 1}) of the previous generation, by recursively matching 
vertices to gradually define M £ A4({i}), all the time avoiding any assignment that would violate 
the condition d^ i i+ iy(M U M') < k. The case of a small edit distance is particularly interesting. 
Our simulations, see Fig. [3j show that when some number x of random changes are made to a 
pedigree, the edit distance is close to x when x is small but for larger values of x, some changes 
cancel each other out and the edit distance grows more slowly than x. Thus, when the edit distance 
is small, it corresponds more closely to the actual number of changes made. 

How many matchings are considered by this method? The following two lemmas establish an 
upper bound of m 2k on this quantity. 

Lemma 4.2. For every fixed M' £ Ai({i + 1}), the number of matchings M £ M({i}) such that 
d{i,i+i}(M U M') < k is at most T(m,k) 2 , where T satisfies the recurrence relation 

T{n, c) = T{n - 1, c) + (n - l)T(n - 1, c - 2) 

with initial conditions T(l, •) = 1 and Tin, 0) = Tin, 1) = 1. 

Proof. First, suppose that there are only m individuals to match (i.e. that there is only one gender). 
Initially, there are m individuals to match and k "cost points" that can be used in doing so. In the 
best case, given some u £ V\u\, there is at most one choice for M(u) that does not increase the 
cost of the matching being built. This follows from the fact that u has at least one child (otherwise 
u is a labeled leaf and so M(u) is already determined), who is already matched somewhere by M' . 
Besides this option for M(u), there are at most m — 1 other options, each of which will increase the 
cost of the matching by at least 2 (since at least one edge will have to be deleted and one edge will 
have to be added). This establishes the recurrence. The initial conditions follow from the following 
facts: 



9 



1. When one individual is left to be matched, there is only one possible way to complete the 
matching being built. 



2. When the matching being built already has distance k (i.e. there are cost points left), there 
is at best only one way to complete the matching. 

This bounds the number of matchings of each gender by T(m,k). Since this process occurs inde- 
pendently for each gender, the number of total matchings is at most T(m, k) 2 . □ 



Lemma 4.3. The recurrence T in Lemma J±.2 satisfies T(n,c) = 0(n c ) 



Proof. We proceed by induction on c. The initial conditions of T give us our base cases of c = 0, 1. 
The general case follows from bounding the difference between successive values of T(-,c): the 
recurrence gives us that T(n,c) — T(n — l,c) = (n— l)T(n— 1, c— 2), which is n-0(n c ~ 2 ) = 0(n c ~ l ) 
by the inductive hypothesis. □ 

Thus, the number of matchings to be considered at generation i is m 2k times the number of 
matchings to be considered at generation i + 1. So the run-time of this algorithm is dominated by 
its last step, in which matchings between the oldest generations of V and V' are considered and the 
best one is chosen; the number of these matchings is at most 0(m 2k ^ 9 ~ 1 ^) = 0(m 2d ) where d is the 
maximum possible distance between the two pedigrees. Thus, if k (the distance between pairs of 
successive generations) and g (the number of generations) are small, the algorithm can efficiently 
calculate edit distance. 

This algorithm always finds the correct edit distance, when the upper bound k is known. When 
k is not known, the algorithm can be adapted to a heuristic by guessing a reasonable k, and if there 
is a step with no matching within distance k, the algorithm is aborted and the randomized heuristic 
(described below) is used instead. In Section [6j we show by comparison to a branch-and-bound 
algorithm that tries all possible matchings that this heuristic adaptation often finds the correct 
edit distance when used on randomly generated pedigrees. 



4.3 Heuristic Improvement of Dynamic Programming Algorithm 

We can turn the dynamic programming algorithm from Section |4.2| into a faster heuristic by enu- 
merating a still smaller set of matchings. For each generation and for a fixed labeling of the 
previous generation, we can create an instance of the maximum-weight bipartite matching prob- 
lem. However, instead of solving the problem exactly, we can enumerate its 7 best solutions and 
consider those matchings only. Since this can be done in time 0(7m 3 ) (see [8]), this improves the 
running-time of the algorithm to 0(m 3 7 9_1 ). 



4.4 Randomized Heuristic for Compatibly Leaf-Labeled Pedigrees 

The randomized matching algorithm for regular pedigrees rests on the idea of viewing pedigrees 
not as lists of parent-child relationships, but rather as sets of so-called descendant splits. The 
descendant split of an individual u G I(P) is the set of individuals v E I(P) such that there is a 
directed path from u to v. When a pedigree is fully labeled, its full set of descendant splits uniquely 
specifies it. For more on descendant splits see [T8] . 

The heuristic calculates the match distance incurred by a matching that is randomly selected 
as follows: at each generation, among individuals of the same gender, an individual u G I(P) is 
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matched to v E I(P') with probability proportional to the number of individuals in the descendant 
splits of the two individuals which are identically labeled. This can be done in polynomial time by 
creating, for each generation, an m x m matrix of individual match probabilities for each gender 
(where there are 2m individuals per generation). The matches are then drawn from these probability 
matrices (without replacement of previously matched individuals) . This can be done multiple times 
to increase the chances of finding a 'good' matching. 

This algorithm is difficult to analyze because different leaves do not always have disjoint paths to 
the vertex being matched. However, we show in simulations (Section [6]) that it performs reasonably 
well relative to a branch-and-bound algorithm that considers all possible matchings. 

5 Hardness Results 

Having presented algorithms for various restrictions of the pedigree isomorphism and edit distance 
problems, we now establish the difficulty of solving the general versions of these problems. Specif- 
ically, we give hardness results for the pedigree isomorphism problem (GI-Hard), the sub-pedigree 
isomorphism problem (NP-Hard), the pedigree edit distance problem (APX-Hard), and the com- 
patibly leaf-labeled pedigree edit distance problem (NP-Hard). The first few proofs require that 
X = 0, while the final proof considers compatibly leaf-labeled pedigrees. 

5.1 The Hardness of Pedigree Isomorphism and Sub-Pedigree Isomorphism 
5.1.1 Pedigree Isomorphism is GI-Hard 

We begin by showing that the pedigree isomorphism problem is as hard as the general graph 
isomorphism problem. Graph isomorphism is one of very few problems not known to be in P that 
is also not known to be NP-Hard. Problems that are as hard as graph isomorphism are known 
as GI-Hard. To show our hardness result, we reduce from bipartite graph isomorphism, which is 
known to be GI-Hard 



Reduction: Given a bipartite graph G = (Vi U V2,E), we define a pedigree V(G) = (P,s). 
Intuitively, each vertex it E V± is replaced with two vertices, a male denoted u m and a female 
denoted it*, and for each edge (u, v) in E, the couple corresponding to u m and u* have two children, 
a male denoted (u,v) m and a female denoted (it, vy . This gives as many couples corresponding to 
v as there are edges going into v. To encode the fact that all of these couples correspond to edges 
with the same vertex v E V 2 , we have every member of the form (u,v) m for some u E V\ mate 
with every member of the form (v! \vy for some u' E Vi to obtain a female child which we denote 
(u,u',vy. Formally, this gives us the following pedigree graph: 

• I(P) = I x u I 2 U I 3 , where 

- I x = {u m ,uf : u E Vi} 

- h = {(u,v) m ,(u,vy : (u, v) E E} 

- h = {(u,u',v)f : (u,v),(u',v) E E} 

• E(P) = E x U E 2 , where 

- E x = {{u s , (u, v) s ') : (u, v) E E, s, s' E {m, /}} 
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- F 2 = {((«, v) m , (u, u', v)f ), ((«, u)/, («', u, «)/) : (u, u), («', v) G £} 
By construction, we have: 
Lemma 5.1. P is a pedigree graph. 
Next, we show: 

Proposition 5.1. Two bipartite graphs G = (V± U V 2 ,E) and G' = (V[, E') with no isolated 
vertices are isomorphic if and only if the two pedigrees V{G) = (P,s) and V(G') = (P',s') are 
isomorphic. 

Proof. (=>:) Suppose we have a graph isomorphism ip : V\ U V 2 — > V[ U V 2 . It is easy to verify that 
the following definitions give a map <f> : I(P) — > I(P') that is a pedigree isomorphism. 

• cj)(u s ) = p(u) s for u s G I\ 

• 4>((u, v) s ) = ((p(u),ip(v)) s for (u, v) s G I2 

• 4>((u,u',v) f ) = ((p(u),(p(u'),tp(v))f for (u,u',v) f G J 3 

(^= : ) We write I(P) = h U J 2 U J 3 and write I(P') = I[ U I' 2 U Jg. Using this notation, our 
assumption gives us an injective map : I\ U I2 U ^3 — > I[ U I' 2 U ^3 that is a graph isomorphism 
between P and P' . 

We observe that 4> must map to Ij because <fi preserves sources and sinks. We also note that (f> 
must preserve familial relationships. We can therefore define the graph isomorphism ip : V\ U V% — > 
V[ U V 2 ' as follows. 

• For u G Vi, set p(u) = u' , where v! G V{ is such that (j)(u s ) = u' s for s G {m, /}. 

• For v G V 2 , let U CZ V% be the neighbors of v. Because U is non-empty, there exists by 
construction a set of vertices in I 2 corresponding to v, all of whom mate with each other and 
no one else. Because it is a pedigree isomorphism, <fi sends this set to a set in I' 2 all of whom 
mate with each other and no one else, and which thus corresponds to a vertex v' G V 2 . We 
set <p{v) = v' . 

We now show that if (u, v) is an edge in G, then ((p(u),ip(y)) is an edge in G' , making <p a 
graph isomorphism. Suppose (u, v) is an edge in G. Then the vertex (u,v) m exists in I(P), and 
there is an edge from u m to (u,v) m in P. Then in P', there is an edge from 4>(u m ) = p{u) m to 
(j)((u,v) m ) = (p(u),ip(v)) m . But then there must be an edge from p(u) to (p(v) in G' . □ 

The case of bipartite graphs with isolated vertices is easy to handle when checking for bipartite 
graph isomorphism: we ensure that there are the same number of isolated vertices in either graph, 
remove them, and then check for isomorphism. Therefore, Proposition 5.1, together with the fact 
that bipartite graph isomorphism is at least as hard as general graph isomorphism, gives us the 
following theorem. 

Theorem 5.1. The pedigree isomorphism problem is Gl-hard. 
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5.1.2 Sub-Pedigree Isomorphism is NP-Hard 

The reduction given in the previous section is easily adapted to show that sub-pedigree isomorphism 
is as hard as bipartite sub-graph isomorphism. Since bipartite sub-graph isomorphism is NP-hard 
by a trivial reduction from bipartite Hamiltonian cycle [2], this gives us the following result. 

Theorem 5.2. The sub-pedigree isomorphism problem is NP-Hard. 

Notice that Theorem |5.2| already implies the following corollary about the hardness of the 
pedigree edit distance problem. 

Corollary 5.1. The pedigree edit distance problem is NP-hard. 

Proof. By reduction from sub-pedigree isomorphism: V = (P,s) is a sub-pedigree of V' = (P',s ! ) 
if and only if the edit distance between them is exactly |i?(P)| — |-E7(P')|. □ 

In the next section, we will improve this result by showing that the pedigree edit distance 
problem is hard even to approximate in polynomial time. 

5.2 Pedigree Edit Distance is APX-Hard 

Our results about the hardness of sub-pedigree isomorphism implied that pedigree edit distance is 
NP-hard. We now strengthen this result, showing that pedigree edit distance is APX-hard. 

Remark. There is extensive literature on the hardness /tractability of the more general problem of 
inexact graph matching fTb\ However, these hardness results do not apply to our edit distance- 
perhaps the hard cases of inexact graph matching are non-pedigrees. 

Our reduction will actually establish the hardness of the following problem: 

The Minimum Cut/Paste Distance between Trees (MCPDT) Problem: Given two di- 
rected rooted unlabeled trees T\,T%, and a natural number k, can T\ be converted into T2 using k 
edge additions/deletions? 

Showing that MCPDT is hard suffices to establish the hardness of the pedigree edit-distance 
problem because an arbitrary unlabeled tree can be trivially turned into an unlabeled monogamous 
pedigree: consider all nodes of the tree to be female and add a founding male mate to each non-leaf 
in the tree. This transformation doubles the cut/paste distance between trees because it exactly 
doubles the number of edges in each tree. 

Remark. Notice that the cut/paste distance between trees is different from the subtree-prune- and- 
regraft (rSPR) operation for binary phy log enetic trees, since rSPR involves maintaining the binary 
property of a phylogenetic tree \37§ and the leaves of a phylogenetic tree are labeled. (In contrast, 
here we have unlabeled trees that are not binary.) 

To show that MCPDT is hard, we reduce from Minimum Common Integer Partition. 

A partition of a positive integer n is a multiset of positive integers that add up to exactly n. For 
example, {3, 2, 2, 1} is a partition of 8. A partition of a multiset S is a multiset union of partitions 
of integers in S. A multiset X is a common partition of two multisets Si, £2 if it is an integer 
partition of both Si and S2. For example, given Si = {8, 5},S2 = {9,4}, X = {5,3,2,2,1} is a 
common partition of Si, S2. 
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The Minimum Common Integer Partition (MCIP) Problem: Given two multisets of 
integers S\ , S2 find the common integer partition of Si and S2 with minimum cardinality. 
Our result will rely on the following fact, proven in [9]. 

Fact 5.1. MCIP is APX-hard. 

We now reduce MCIP (Minimum Common Integer Partition) to MCPDT (Minimum Cut/Paste 
Distance between Trees) with an L-reduction [25J. 

Reduction: Given Si = {ni, ri2, n p }, S2 = {mi, m,2, ntq}, we construct two trees Tl, T2 with 
roots r\,r2 such that for each 1 < i < p (resp. 1 < j < q) there is a distinct path from r\ (resp. 
T2) of length raj (resp. rrij). For example see Figure [2j This means that we need to cut T±, T2 into 
a common forest in which each tree except the ones containing ri,r2, is a path. 




Figure 2: An example for the reduction. Let S\ = {8, 5, 1} and 52 = {10,4}. The optimal number 
of cut/paste operations is 2. 



Proposition 5.2. MCPDT is APX-hard. 

Proof. We prove that the reduction given above is an L-reduction. Let opt(MCIP) and ^4(MCIP) 
be the values of the optimal and approximate solutions, respectively, of our instance of MCIP. 
Let min = min{p, q}. Then the value of the optimal solution for our instance of MCPDT is 
opt = opt(MCIP) — min. In other words, we can conclude that Si, S2 has common integer partition 
of size k if and only if Ti, T2 each can be cut into a common forest with k — min cuts. 

Moreover, A(MCPDT) = yl(MCIP) - min is the value of a feasible solution of MCPDT. Clearly 
we have 

1. opt < a ■ opt(MCIP), by setting a = 1. 

2. |opt(MCIP) - A(MCIP)| < P ■ |opt - ^(MCPDT)|, by setting = 1. 
To see the second claim, notice that 

|opt - ^(MCPDT)| = |opt - (A(MCIP) - min)| 
= |opt(MCIP) -A(MCIP)\ 

Therefore, this reduction is an L-reduction. As MCIP is APX-hard, MCPDT is also APX-hard. □ 
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This result implies that unless P=NP, there is no PTAS for MCPDT, and so the pedigree edit 
distance problem is APX-hard in general. 

Theorem 5.3. The pedigree edit distance problem is APX-hard. 

Certainly for leaf-labeled trees, it is well known that there is a polynomial-time algorithm for 
computing cut/paste distance (i.e. on trees). However, this algorithm does not work for leaf- labeled 
pedigree graphs. Next we establish the hardness of the leaf-labeled edit distance. 



5.3 Compatibly Leaf-Labeled Pedigree Edit Distance is NP-Hard 

To prove the hardness of the compatibly leaf-labeled edit distance problem, we will take an instance 
of the edit distance problem without leaf labels (i.e. X = 0), and create an instance of the compat- 
ibly leaf-labeled edit distance problem whose solution allows us to compute the edit distance for 
the original edit distance instance. 

Theorem 5.4. The leaf-labeled pedigree edit distance problem is NP-hard. 

Proof. Given non- labeled pedigrees V = (P, s) and V' = (P', s'), we define compatibly leaf-labeled 
pedigrees Q = (Q,X,t,£) and Q' = (Q' , X' ,t' ,£') as follows. Q is obtained from V by adding, 
for each individual u 6 I(P), an individual u' of the opposite gender, and, for each individual 
v £ P{P'), a new individual i uv which is the child of u and v! . Q' is obtained from V' similarly: for 
each individual v G I(P') and u G P{P), we create an individual j VjU in I(Q') which is the child of 
v and v' , where v' is a founder individual of the opposite gender as v, also added to I(Q'). Now Q 
and Q! have leaf sets {i u ,v} and {j VtU }, respectively. Let £ be defined arbitrarily on {i u , v }, and let 
£'(jv,u) = £{iu,v)- Then Q and Q' are compatibly leaf-labeled pedigrees. The following proposition 
completes the proof of the theorem: 

Proposition 5.3. D vr = D QiQ , - 2 (\I(P)\\I(P')\ - mm{\I(P)\,\I(P')\}) 

Proof. Without loss of generality, we assume that |/(-P)| < \I(P')\. Suppose that D-p -pi = d. Then 
there is a matching M that achieves this distance and we may assume that the M is defined on 
all of I(P), because if not we can extend M arbitrarily to all of I{P) without changing the match 
distance. 

M extends uniquely to a matching iV defined on all of I(Q) that respects the labels of the added 
leaf individuals and such that N(u') = M(u)' . We now note that, for every individual on which M 
is defined, N will have exactly one additional well- matched edge. Therefore, \Wjy\ = \ Wm \ + |/(P)|. 
We also have \E{Q)\ = \E(P)\ + \I(P)\\I{P')\ and E(Q') = E{P') + \I{P)\\I(P')\. This gives that 
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£>q,q' <d + 2\I{P)\\I(P')\-2\I(P)\ by Definitions [2^4] and 

Now suppose that -Dq,q' = d. This means that there is a matching N, defined again without 
loss of generality on all of I(Q), that achieves this edit distance. If N does not take every u' £ I(Q) 
to N(u)' € I(Q'), we can modify it so that this is the case, since this can only increase the number 
of well-matched edges of N. Once this is established, the same argument used above shows that, 
if we define M to be the restriction of N to J(P), then D V p, < d - 2\I(P)\\I(P')\ + 2\I(P)\. This 
completes the proof of the proposition. □ 

□ 
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6 Simulation Results 



We evaluated the general randomized heuristic (Section 4.4) and the dynamic programming algo- 
rithm (Section 4.2) against a brute-force branch-and-bound algorithm that finds the correct edit 
distance on general pedigrees in time exponential in pedigree size. We used the modified dynamic 
programming algorithm for the case where an upper bound on the distance is unknown: we chose a 
reasonable upper bound, and if the algorithm failed to find a matching at a given step (because the 
distance between the two pedigrees was too large), it ran the randomized heuristic instead. Our 
simulations were performed on small, three-generation pedigrees so that the edit distance could be 
computed using the exponential-time branch-and-bound algorithm. 

From the simulations, it appears that the heuristic algorithm provides a reasonable estimate of 
the edit distance, especially when the two pedigrees being compared are very similar to each other. 
The DP algorithm provides the correct answer when the two pedigrees are similar, a reasonably close 
answer when the pedigrees are not very similar, and an answer that matches the heuristic algorithm 
when the the pedigrees are very different. Of course, these results depend on the parameter k we 
chose. 



The simulation. We first drew a leaf-labeled pedigree V = (P, s, X, £) from a Wright-Fisher sim- 
ulation where every generation has a fixed number 2m of individuals, there is no inter-generational 
mating, each monogamous couple has a number of offspring drawn from a Poisson distribution 
with mean A, and all leaves are labeled. We then randomly perturbed V to obtain V' by having 
some fraction x of non-founders choose a new parent of one gender uniformly and independently 
at random. (Results obtained using a perturbation model that preserved monogamy were similar.) 
Note that V and V' are always compatibly leaf-labeled. 



Algorithms compared. We recorded the following measures of similarity for the pedigrees V 
and V . 

1. Simulated Edit Path Length: x 

2. Random-Matching Heuristic Estimate: Dp pi I '{\E(P)\ + \E(P')\), where Dp pi is the output 
of the random-matching heuristic. 

3. Normalized Edit Distance: Dp^pi / {\E{P)\ + \E(P')\), where Dp^pi is the output of the branch- 
and-bound algorithm. 

4. DP Estimate: Dp pi /{\E(P)\ + \E(P')\), where Dp pi is the output of the dynamic program- 
ming algorithm, modified for the case where there is no guarantee on distance. 

Remark. Notice that x is often larger than the edit distance because the edit path taken in the 
simulation was longer than the shortest edit path. 

Remark. Since our pedigrees were randomly generated and perturbed, in practice we could not 
ensure the DP algorithm's condition that the pedigrees being compared be sufficiently similar at 
each generation. Therefore we modified the algorithm to assume a reasonable upper bound k = 8 
on this distance and give the output of the random heuristic if no matching was found that met this 
condition. 
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Figure 3: Comparing Different Distances Estimates. With 2m = 14 and A = 3, there were 
2500 pairs of pedigrees simulated. Each point is an average of 50 simulations. The values of n 
and A were chosen such that the branch-and-bound algorithm would finish computing the true edit 
distance. The random matching heuristic yields an estimated edit distance which is fairly close to 
the true edit distance. The DP algorithm performs nearly perfectly for small numbers of actual 
changes, while it returns the solution found by the random-matching heuristic when it cannot find 
a solution for parameter k = 8. The left panel shows the accuracies of each algorithm. The right 
panel shows the difference in accuracy between the true edit distance and each distances returned 
by the random-matching heuristic and DP algorithm. 

Simulation results. The three different results we recorded are plotted in Figure [3] against x, the 
fraction of pedigree edges changed during simulation. Figure [3] also shows the difference between 
the random-matching and true edit distances. Figure [4] shows the running times for the three 
algorithms. 

We see the random-matching heuristic performs reasonably well, both in terms of accuracy and 
time. The DP algorithm agrees with the true edit distance when that distance is small and agrees 
with the random-matching estimate when the distance is large. However, in the intermediate area, 
we see that the DP produces an answer between the optimal and the heuristic value, because there 
are matchings that satisfy the distance threshold at every generation which are not the optimal 
matching and the optimal matching contains a generation that fails the distance threshold. Due to 
its accuracy, we recommend that the DP algorithm and the randomized heuristic be used together. 
If degree of accuracy is not needed, then we recommend using linear-time leaf-labeled isomorphism 
algorithm. 

7 Discussion 

In this paper, we introduced two pedigree comparison problems — pedigree isomorphism and pedi- 
gree edit distance — and we presented algorithms and hardness results for both. Several interesting 
open questions remain: 
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Figure 4: Running Times. These are box plots comparing the running times of the three different 
algorithms: heuristic algorithm, branch-and-bound algorithm, and the DP algorithm. The heavy 
line is the median, the rectangle indicates the first and third quartiles. In this case the median is 
coincident with the first quartile for all three algorithms. The outliers are not shown; specifically, 
there are a number of very long execution times for the optimal algorithm. 



• Fractional edit distance. An alternative definition of edit distance could be based on 
fractional matchings: instead of minimizing over all one-to-one matchings of vertices in the 
two pedigrees, we could allow a vertex of P to be matched to multiple vertices in P' with 
weights summing to one. Such a distance could be easier to compute, although the biological 
interpretation is less clear. It would also be interesting to explore the relationship between 
the definition presented in this paper and this alternate definition. 

• Pedigrees with inter-generational mating Another open problem of interest is how the 
edit distance algorithms work on pedigrees with inter-generational mating. The simulations 
used here were based on the Wright-Fisher model and did not allow any inter-generational 
mating events. It may be of interest to simulate the pedigrees using a birth-death model such 
as the Moran model where inter-generational mating is allowed. Such a simulation would 
allow the evaluation of the distance heuristics on non- regular pedigrees. 

• Pedigree isomorphism without labels A very interesting open problem is that of pedigree 
isomorphism without labels. Since the graph isomorphism problem is reducible to it, it is 
conceivable that existing algorithms for graph isomorphism might be of use for pedigree 
graphs. 

Comparison of pedigrees is an interesting and important problem. Here, we have taken the first 
steps towards understanding and solving it. 
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